improving heuristic reading order #206

bertsky · 2025-10-20T15:45:09Z

WIP, starting off with regressions from 0.5.0 and old issues (IndexError etc)

TODO:

modify return_boxes_of_images_by_order_of_reading_new such that it becomes mildly recursive, in order to avoid cutting through regions: if (for some y slice) some columns have much higher peaks than others, then pick those first and search for new y splitters within the others

(also, simplify `run` and separate `run_single`)

extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)

regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators

- enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)

- when handling lines without mother, and biggest line already accounts for all columns, but some are too close to the top and therefore must be removed, avoid invalidating `biggest` index, causing `IndexError` - remove try-catch (now unnecessary) - array instead of list operations

simplify and document - simplify - rename identifiers to make readable: - `y_sep` → `y_mid` (because the cy gets passed) - `y_diff` → `y_max` (because the ymax gets passed) - array instead of list operations - add docstring and in-line comments - return (zero-length) numpy array instead of empty list

when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well

- array instead of list operations - return array of index pairs instead of list objects

- array instead of list operations - add better plotting (but commented out) - add more debug printing (but commented out) - add more inline comments for documentation - rename identifiers to make more readable: - `cy_hor_diff` → `y_max_hor_some` (because the ymax gets passed) - `lines` → `seps` - `y_type_2` → `y_mid` - `y_diff_type_2` → `y_max` - `y_lines_by_order` → `y_mid_by_order` - `y_lines_without_mother` → `y_mid_without_mother` - `y_lines_with_child_without_mother` → `y_mid_with_child_without_mother` - `y_column` → `y_mid_column` - `y_column_nc` → `y_mid_column_nc` - `y_all_between_nm_wc` → `y_mid_between_nm_wc` - `lines_so_close_to_top_separator` → `seps_too_close_to_top_separator` - `y_in_cols` and `y_down` → `y_mid_next` - use `pairwise()` `nc_top:nc_bot` instead of `i_c` indexing

when y slice (`top:bot`) is not a significant part of the page, viz. less than 22% (as in `find_number_of_columns_in_document`), avoid forcing `find_num_col` to reach `num_col_classifier` (allows large headers not to be split up and thus better ordered)

(by removing unnecessary conditional)

- use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`

- drop connected components analysis to test overlaps between horizontal separators and (horizontal) neighbours (introduced in ab17a92) - instead of converting headings to topline and baseline during `find_number_of_columns_in_document` (introduced in 9f1595d7), add them to the matrix unchanged, but mark as extra type (besides horizontal and vertical separtors) - convert headings to toplines and baselines no earlier than in `return_boxes_of_images_by_order_of_reading_new` - for both headings and horizontal separators, if they already span multiple columns, check if they would overlap (horizontal) neighbours by looking at successively larger (left and right) intervals of columns (and pick the largest elongation which does not introduce any overlaps)

bertsky · 2025-10-25T12:01:35Z

Sorry for the force-push! I had accidentally rebased back to a2a06a8 (which now became cd35241).

Still have not addressed the big TODO (which is coming shortly), but found some more useful changes along the way:

b2a79cc was clearly a bug
66a0e55 is an improvement: find_num_col should not enforce the overall num_col_classifier result for every part of the page, but rather (as in find_number_of_columns_in_document) only for the big parts, so there would still be room for sections with headings stretching multiple columns
19b2c3f improves on the previous change to elongate horizontal separators if that does not introduce any overlap: now this is done later, when the column split is already determined, so one can also try partial spans; it also covers headings now, but single-column separators/headings are exempt from this now

There are also lots of new plotting directives (commented out). I'll upload some explanatory images next.

Robert Sachunsky added 21 commits October 20, 2025 17:40

binarization: add option --overwrite, skip existing outputs

086c188

(also, simplify `run` and separate `run_single`)

find_num_cols: re-sort peaks when cutting n-best num_col_classifier

184927f

find_num_col: simplify, add better plotting (but commented out)

48761c3

order_of_regions: filter out-of-image peaks

c43a825

order_of_regions: add better plotting (but commented out)

d3d599b

find_number_of_columns_in_document: simplify, rename line→seps

542d38a

find_number_of_columns_in_document: improve splitter rule

5a0e4c3

extend horizontal separators to full img width if they do not overlap any other regions (only as regards to returned `splitter_y` result, but without changing returned separators mask)

find_number_of_columns_in_document: split headings at top+baseline

cd35241

regarding `splitter_y` result, for headings, instead of cutting right through them via center line, add their toplines and baselines as if they were horizontal separators

return_boxes_of_images_by_order_of_reading_new: simplify

7c3e418

- enumeration instead of indexing - array instead of list operations - add better plotting (but commented out)

return_x_start_end_mothers_childs_and_type_of_reading_order: fix+1

b2a79cc

when calculating `reading_order_type`, upper limit on column range (`x_end`) needs to be `+1` here as well

find_number_of_columns_in_document: simplify

acee4c1

contours_in_same_horizon: simplify

5d15941

- array instead of list operations - return array of index pairs instead of list objects

find_num_col: add better plotting (but commented out)

6cc5900

return_boxes_of_images_by_order_of_reading_new: indent

3ebbc2d

(by removing unnecessary conditional)

delete_separator_around: simplify, eynollah: identifiers

a2a9fe5

- use array instead of list operations - rename identifiers: - `pixel` → `label` - `line` → `sep`

return_boxes_of_images_by_order_of_reading_new: change arg order

3367462

bertsky force-pushed the ro-fixes branch from f45df87 to 19b2c3f Compare October 25, 2025 11:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

improving heuristic reading order #206

improving heuristic reading order #206

bertsky commented Oct 20, 2025

Uh oh!

bertsky commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

improving heuristic reading order #206

Are you sure you want to change the base?

improving heuristic reading order #206

Conversation

bertsky commented Oct 20, 2025

Uh oh!

bertsky commented Oct 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant